eMailSift: Adapting Graph Mining Techniques for Email Classification

نویسندگان

  • Manu Aery
  • Sharma Chakravarthy
چکیده

Text classification is the problem of assigning pre-defined class labels to incoming, unclassified documents. The class labels are defined based on a set of examples of pre-classified documents, used as a training corpus. For text classification, a number of approaches have been proposed such as Support Vector machines, Decision trees, k-nearest-neighbor classification, Linear Least Square fit and Bayesian classification among others. The need for handling and classifying large amounts of personal emails have prompted the use of text classification approaches to address email classification. Email classification is trickier than text classification as it is based on personal preferences, consequently it uses disparate criteria which are difficult to quantify. Also, documents are richer in content as compared to emails whose content can vary dramatically from folder to folder, hence conventional approaches may not be well-suited. In addition, as opposed to a static set of corpus typically used for training in text classification, the mail environment is constantly changing, with a need for adaptive and incremental re-training. In this report we propose a novel, graph based mining approach for email classification. Our approach is based on the premise that representative – common and recurring – structures/patterns can be extracted from a pre-classified email folder and the same can be used effectively for classifying incoming email messages. A number of factors that influence representative structure extraction and classification are analyzed conceptually and validated experimentally. In our approach, the notion of inexact graph match is leveraged for deriving structures that provide coverage for characterizing folder contents. Extensive experimentation validate the selection of parameters and the effectiveness of our approach for email classification. We also compare the performance of our approach with the Naive Bayesian classifier.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

InfoSift: Adapting Graph Mining Techniques for Text Classification

Text classification is the problem of assigning pre-defined class labels to incoming, unclassified documents. The class labels are defined based on a set of examples of pre-classified documents used as a training corpus. Various machine learning, information retrieval and probability based techniques have been proposed for text classification. In this paper we propose a novel, graph mining appr...

متن کامل

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...

متن کامل

Credit scoring in banks and financial institutions via data mining techniques: A literature review

This paper presents a comprehensive review of the works done, during the 2000–2012, in the application of data mining techniques in Credit scoring. Yet there isn’t any literature in the field of data mining applications in credit scoring. Using a novel research approach, this paper investigates academic and systematic literature review and includes all of the journals in the Science direct onli...

متن کامل

Detection of Breast Cancer Progress Using Adaptive Nero Fuzzy Inference System and Data Mining Techniques

Prediction, diagnosis, recovery and recurrence of the breast cancer among the patients are always one of the most important challenges for explorers and scientists. Nowadays by using of the bioinformatics sciences, these challenges can be eliminated by using of the previous information of patients records. In this paper has been used adaptive nero fuzzy inference system and data mining techniqu...

متن کامل

The application of data mining techniques in manipulated financial statement classification: The case of turkey

Predicting financially false statements to detect frauds in companies has an increasing trend in recent studies. The manipulations in financial statements can be discovered by auditors when related financial records and indicators are analyzed in depth together with the experience of auditors in order to create knowledge to develop a decision support system to classify firms. Auditors may annot...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004